1
从玩具数据集到真实世界的混乱
EvoClass-AI002Lecture 5
00:00

1. 搭建桥梁:数据加载基础

深度学习模型依赖于干净、一致的数据,但现实世界中的数据集本质上是杂乱无章的。我们必须从预打包的基准测试(如MNIST)转向管理非结构化数据源,在这些场景中,数据加载本身就是一个复杂的协调任务。这一过程的基础在于PyTorch为数据管理提供的专业工具。

核心挑战在于将存储在磁盘上的原始、分散的数据(图像、文本、音频文件)转化为高度组织化、标准化的PyTorch 张量格式 GPU所期望的格式。这需要自定义逻辑来完成索引、加载、预处理,最终实现批处理。

真实世界数据的关键挑战

  • 数据混乱: 数据分散在多个目录中,通常仅通过CSV文件进行索引。
  • 需要预处理: 图像可能需要在转换为张量之前进行缩放、归一化或增强处理。
  • 效率目标: 数据必须以优化的、非阻塞的批次形式传送到GPU,以最大化训练速度。
PyTorch的解决方案:职责分离
PyTorch强制实施关注点分离: Dataset 负责“做什么”(如何访问单个样本和标签),而 DataLoader 则负责“怎么做”(高效批处理、打乱顺序以及多线程交付)。
data_pipeline.py
TERMINALbash — data-env
> Ready. Click "Run" to execute.
>
TENSOR INSPECTOR Live

Run code to inspect active tensors
Question 1
What is the primary role of a PyTorch Dataset object?
To organize samples into mini-batches and shuffle them.
To define the logic for retrieving a single, preprocessed sample.
To perform the matrix multiplication inside the model.
Question 2
Which DataLoader parameter enables parallel loading of data using multiple CPU cores?
device_transfer
batch_size
num_workers
async_load
Question 3
If your raw images are all different sizes, which component is primarily responsible for resizing them to a uniform dimension (e.g., $224 \times 224$)?
The DataLoader's collate_fn.
The GPU's dedicated image processor.
The Transformation function applied within the Dataset's __getitem__ method.
Challenge: The Custom Image Loader Blueprint
Define the structure needed for real-world image classification.
You are building a CustomDataset for 10,000 images indexed by a single CSV file containing paths and labels.
Step 1
Which mandatory method must return the total number of samples?
Solution:
The __len__ method.
Concept: Defines the epoch size.
Step 2
What is the correct order of operations inside __getitem__(self, index)?
Solution:
1. Look up file path using index.
2. Load the raw data (e.g., Image).
3. Apply the necessary transforms.
4. Return the processed Tensor and Label.